Building a highly resilient multi-region database using CockroachDB
Start LearningIn the media and streaming industry, downtime is simply not acceptable. From the infamous Game of Thrones outages, to celebrities “breaking the internet,” to fans missing critical moments of live sporting events, these types of scenarios really upset consumers.
Mux specializes in delivering a platform for developers to build high-quality live and on-demand video streaming experiences. They are focused on delivering a great experience for their impressive list of customers, which means they need a reliable, fault-tolerant infrastructure.
Since 2018, Mux has been building on CockroachDB and leverages a multi-region and multi-cloud setup to achieve high availability and fast performance for Mux Video. More recently, they built a new internal service on CockroachDB that allowed them to consolidate several signing keys systems into a unified global system. Not only does this internal service improve their own developers’ productivity, it also creates a better end-user experience by responding to distributed requests in real time.
Mux products use signing keys to validate requests like signed playback URLs and real-time viewer counts. Previously, different Mux products used separate implementations of signing keys, in part because they were based in different regions and/or clouds.
However, as Mux built more products that required signing keys, this approach became undesirable and hampered the customer experience. For example, the need for separate keys per product grew confusing and caused unnecessary engineering maintenance. Mux needed to unify their separate signing keys systems to solve this issue.
In the planning phases for this consolidation project, the team considered using PostgreSQL (which offers strong data consistency) as the primary data store, and would deploy edge CDNs to cache keys. However, they thought this would have required more engineering time and complexity managing data replication and networking themselves.
“This product use case had very specific requirements around data consistency, latency, and availability. CockroachDB offered a battle-tested solution that met those requirements and would allow us to easily scale horizontally and expand into different regions and clouds.”
– Faith Szeto, Software Engineer at Mux
Instead, they chose to migrate all of their keys to CockroachDB to have one global service. CockroachDB was able to fit their use case, which required data consistency, low-latency, multi-region, and multi-cloud capabilities. From start to finish, the development of this project took around 8 months.
Today, the service is used by Mux applications to perform signing keys lookups and writes. These keys are used for things like verifying that a playback URL is authorized to stream an asset, or allowing participants to join a Mux Real-Time Space. They store public keys in a single table, with an average of 5 queries per second (QPS).
For their setup, Mux deploys a CockroachDB cluster with Kubernetes, self-hosted on AWS and GCP. A golang microservice fronts the database and is used by other applications to access signing keys information via GRPC and HTTP.
An instance of the service (and its co-located CockroachDB nodes) exists in multiple Kubernetes clusters, each serving local requests with low latency. They receive most of their writes through one region and rely on CockroachDB to propagate the data to all the other regions. Since Mux cannot assume that a key’s use will be limited to any one region, they use CockroachDB’s GLOBAL TABLE setting for their table to make all keys accessible in every region.
Applications that would usually connect to their local CockroachDB region can failover to an external region if there is a cloud/regional outage and still guarantee data correctness. This is an extremely fault-tolerant setup so that Mux can ensure their service is always on and available.
“We are able to achieve data consistency and replication across multiple clouds and regions, all without managing that complexity ourselves. This is important to Mux’s commitment to providing multi-region, multi-cloud capabilities, along with consistent high availability. We are also able to horizontally scale our database with load, and expand into new regions without introducing more complexity to our replication model.”
—Faith Szeto, Software Engineer at Mux
Now that they set up the CockroachDB cluster and have proven its global capabilities in production, Mux plans to expand on this distributed database model. So far, they started to migrate other databases that contain globally relevant information, like feature flags and organizational context, into CockroachDB.
Previously, this data was limited to certain regions and was not easily accessible to all of Mux’s Kubernetes clusters. Mux engineers report that since most of the legwork in setting up their cluster is done, it will be easy to migrate other databases into this cluster as well.
Ultimately their customer-facing applications will already be set up to connect, and Mux won’t need to deal with the additional latency and consistency issues associated with accessing data externally. All of these aspects lead to a better end-user and developer experience.
You can take a look at the final design of the Mux architecture and see all the details in their blog post about getting critical data everywhere at the same time:
To learn more about Mux’s other use cases, click here. Special thanks to Faith Szeto (Software Engineer @ Mux) for providing the information for this post.
TL;DR - Multi-region application architecture makes applications more resilient and improves end-user experiences by …
Read moreAccording to Gartner, by 2025 over 95% of new digital workloads will be deployed on cloud-native platforms. It makes …
Read moreIn 2020, Hard Rock International (HRI) and Seminole Gaming (SGA) launched Hard Rock Digital and tasked that organization …
Read more